Atom AI Labs - AI-Powered Multi-Tenant Platform

ATOM SaaS Production Runbook

**Last Updated:** 2026-02-22

**Platform Version:** v2.3

**Environment:** Production (ATOM Cloud)

**Target Audience:** DevOps Engineers, Site Reliability Engineers, On-Call Engineers

---

Overview
Production Architecture
Deployment Procedures
Monitoring & Observability
Incident Response
Common Issues & Resolutions
Data Backup & Recovery
Security & Compliance
Maintenance Windows
Emergency Contacts

---

Overview

Platform Components

**ATOM SaaS** is a multi-tenant AI agent platform deployed on ATOM Cloud with the following components:

**web-platform** - Main production app (Next.js + Python backend)

URL: https://[tenant].atomagentos.com
Components: Next.js frontend (port 3000) + Python FastAPI backend (port 8000)
Resources: 1GB RAM, 1 CPU (shared)
Nodes: 1 minimum (auto-scaling enabled)

**api-service** - Dedicated Python backend API

URL: https://[tenant].atomagentos.com/api
Components: Python FastAPI only (port 8000)
Resources: 1GB RAM, 1 CPU (shared)
Nodes: 2 (rolling deployments)

**Database** - Neon PostgreSQL

Managed database service with connection pooling
Automatic backups and point-in-time recovery

**Redis** - Upstash Redis

URL format: https://*.upstash.io
Used for: rate limiting, caching, session storage

**Storage** - AWS S3

Tenant-isolated storage (s3://atom-saas/{tenant_id}/)
Used for: file uploads, canvas assets, agent artifacts

Key Technologies

**Frontend:** Next.js 14, React 18, TypeScript, Tailwind CSS
**Backend:** Python 3.11+, FastAPI, SQLAlchemy, Alembic
**Database:** PostgreSQL with Row-Level Security (RLS)
**Deployment:** ATOM Cloud with Docker containers
**Monitoring:** Cloud metrics, logs, health checks

---

Production Architecture

Application Deployment Strategy

The platform uses a **dual-app deployment strategy** to separate web and AI workloads:

┌─────────────────────────────────────────────────────────────┐
│                     web-platform (Main)                     │
│  ┌──────────────────┐         ┌──────────────────┐         │
│  │   Next.js        │         │   Python (ROLE=web)        │
│  │   Port 3000      │         │   Port 8000      │         │
│  │   153+ API routes│         │   Brain systems  │         │
│  └──────────────────┘         └──────────────────┘         │
│                     │                    │                  │
└─────────────────────┼────────────────────┼──────────────────┘
                      │                    │
                      ▼                    ▼
              ┌─────────────┐      ┌─────────────┐
              │    S3       │      │   Neon DB   │
              └─────────────┘      └─────────────┘

┌─────────────────────────────────────────────────────────────┐
│                    api-service (Backend)                    │
│  ┌──────────────────┐                                       │
│  │   Python (ROLE=api)                                      │
│  │   Port 8000                                               │
│  │   LLM processing, embeddings, reasoning                 │
│  └──────────────────┘                                       │
│                     │                                        │
└─────────────────────┼────────────────────────────────────────┘
                      │
                      ▼
              ┌─────────────┐      ┌─────────────┐
              │   Upstash   │      │   Neon DB   │
              │   Redis     │      └─────────────┘
              └─────────────┘

Environment Variables

**Critical secrets** (managed via atom-cli secrets):

# Database
DATABASE_URL=postgresql://...

# Authentication
NEXTAUTH_SECRET=...
NEXTAUTH_URL=https://[tenant].atomagentos.com

# LLM Providers (BYOK)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

# External Services
REDIS_URL=https://*.upstash.io
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
STRIPE_SECRET_KEY=sk_live_...

# Email (SES)
SES_AWS_ACCESS_KEY_ID=...
SES_AWS_SECRET_ACCESS_KEY=...
SES_REGION=us-east-1

Health Check Endpoints

**Main App (web-platform):**

GET /api/health - Full health check (DB, Redis, services)
Health check interval: 15s
Grace period: 30s
Timeout: 10s

**Backend API (api-service):**

GET /alive - Simple liveness (no DB required)
GET /health - Full health check (DB, Redis, LLM providers)
Health check interval: 30s
Grace period: 90s
Timeout: 10s

---

Deployment Procedures

Prerequisites Checklist

Before deploying to production, verify:

[ ] All tests passing locally (npm test && cd backend-saas && pytest)
[ ] No critical security vulnerabilities (npm audit --audit-level=high)
[ ] Database migrations tested locally (alembic upgrade head)
[ ] Environment variables documented
[ ] Staging environment validated (if available)
[ ] Backup created before major changes
[ ] Team notified of deployment
[ ] Rollback plan documented

Deployment: Main App (web-platform)

**Standard Deployment:**

# From repository root
atom-cli deploy

# With specific Dockerfile
atom-cli deploy --dockerfile Dockerfile

# Check deployment status
atom-cli status

**Deployment Process:**

Code pushed to main branch
atom-cli deploy triggers build
Depot builder creates Docker image (cached layers)
Release command runs migrations (./backend-saas/scripts/run_migrations.sh)
Rolling deployment updates machines (zero downtime)
Health checks validate service availability
New version receives production traffic

**Expected Duration:** 3-5 minutes

**What Happens During Deployment:**

Docker image built (cached layers speed this up)
Database migrations run automatically
Next.js frontend builds (production optimized)
Python backend starts with ROLE=web
Health checks validate all services
Old machines replaced one-by-one (rolling update)

Deployment: Backend API (api-service)

**Standard Deployment:**

# From backend-saas directory
cd backend-saas
atom-cli deploy --config infrastructure.config

# Alternative from root
atom-cli deploy --dockerfile backend-saas/Dockerfile.api

**Deployment Process:**

Code pushed to main branch
atom-cli deploy triggers API-only build
Docker image built (Dockerfile.api)
Migrations run during startup (lifespan function)
Rolling deployment to 2 machines
Health checks validate Python backend
New version receives API traffic

**Expected Duration:** 2-4 minutes

**Key Differences from Main App:**

Uses Dockerfile.api (Python-only build)
ROLE=api environment variable
Migrations run in lifespan() (not release_command)
Auto-stop when idle (cost optimization)
2 machines for rolling deployments

Post-Deployment Verification

After deployment completes, verify:

# 1. Check app status
atom-cli status

# 2. Verify health endpoints
curl https://[tenant].atomagentos.com/api/health
curl https://[tenant].atomagentos.com/api/alive

# 3. Check node status
atom-cli nodes list

# 4. View recent logs
atom-cli logs --lines 50

**Expected Results:**

Health endpoints return 200 OK
Machines show "running" state
No errors in recent logs
Critical paths functional (auth, agents, skills)

Rollback Procedures

**Automatic Rollback (Health Check Failure):**

If health checks fail after deployment, the platform automatically rolls back to the previous version. No manual intervention required.

**Manual Rollback:**

# View deployment history
atom-cli deployments

# Rollback to specific version
atom-cli rollback <version>

**Database Rollback (if needed):**

# SSH into machine
atom-cli console

# Navigate to backend
cd /app

# Rollback last migration
alembic downgrade -1

# Rollback to specific revision
alembic downgrade <revision_id>

# Verify current revision
alembic current

**⚠️ WARNING:** Database rollbacks can cause data loss if migration involved data changes. Always backup before rollback.

Zero-Downtime Deployment Strategy

**Current Setup:**

Rolling deployments (one machine at a time)
Health check grace period (30s main, 90s API)
Minimum machines running (1 main, 1 API)

**Best Practices:**

Deploy during low-traffic hours when possible
Monitor health checks during deployment
Have rollback plan ready
Test migrations locally first
Use feature flags for major changes

---

Monitoring & Observability

Key Metrics to Monitor

Application-Level Metrics

**Request Metrics:**

Request rate (requests per second)
Response times (p50, p95, p99)
Error rate (4xx, 5xx)
Throughput (requests per minute)

**Target Thresholds:**

p95 response time: < 2s (100 concurrent users)
Error rate: < 1%
Request rate: Scale up if sustained > 100 req/s

**Business Metrics:**

Agent execution rate (agents per hour)
Graduation exam success rate (%)
Active agents count
Tenant activity (daily active tenants)

Infrastructure Metrics

**ATOM Cloud Metrics:**

CPU usage (%)
Memory usage (%)
Disk usage (%)
Network in/out (bytes per second)

**Target Thresholds:**

CPU usage: Alert if > 80% for 5 minutes
Memory usage: Alert if > 85% for 5 minutes
Disk usage: Alert if > 90%

**Database Metrics (Neon PostgreSQL):**

Connection pool usage (%)
Query performance (slow queries > 1s)
Database size (GB)
Transaction rate (tx per second)

**Target Thresholds:**

Connection pool: Alert if > 80%
Slow queries: Investigate if > 10 per minute
Database size: Alert if > 90% of quota

**Redis Metrics (Upstash):**

Hit rate (%)
Memory usage (%)
Command rate (commands per second)
Connection count

**Target Thresholds:**

Hit rate: > 80% (indicates effective caching)
Memory usage: Alert if > 90%

LLM Provider Metrics

**OpenAI API:**

Request latency (p50, p95)
Error rate (4xx, 5xx)
Rate limit hits (429 responses)
Token usage (tokens per day)

**Target Thresholds:**

Request latency: < 5s p95
Error rate: < 2%
Rate limit hits: Alert if > 10 per minute

Monitoring Dashboards

**ATOM Cloud Console:**

URL: https://console.atomagentos.com
Metrics: CPU, memory, network, requests
Logs: Real-time log streaming
Machines: Machine status and health

**Cloud Console:**

URL: https://console.atomagentos.com
Metrics: CPU, memory, network, requests
Logs: Real-time log streaming
Nodes: Node status and health

**Neon Console:**

Database metrics and performance
Slow query analysis
Connection pool monitoring

**Upstash Console:**

Redis metrics and hit rate
Memory usage and commands
Connection monitoring

Log Aggregation

View real-time logs

atom-cli logs

View last N lines

atom-cli logs --lines 100

Follow logs (tail -f)

atom-cli logs --tail

**Log Levels:**

INFO - Normal operations (startup, requests)
WARNING - Non-critical issues (rate limits, retries)
ERROR - Errors (exceptions, failed requests)
CRITICAL - Critical failures (crashes, data loss)

**Common Log Patterns:**

**Successful Request:**

INFO:     10.0.0.1:12345 - "GET /api/agents HTTP/1.1" 200 OK
INFO:     Request completed in 123ms

**Rate Limit:**

WARNING:  Rate limit exceeded for tenant <tenant_id>
WARNING:  429 Too Many Requests

**Database Error:**

ERROR:    Database connection failed
ERROR:    sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect

**LLM Provider Error:**

ERROR:    OpenAI API request failed
ERROR:    openai.error.RateLimitError: Rate limit exceeded

Alert Thresholds

Monitoring is performed via the **IntegrationMetrics** system, which enqueues on-demand evaluation tasks to **QStash**.

**Critical Alerts (Immediate Action Required):**

Metric	Threshold	Duration	Action
App health check	> 50% failures	1 minute	Investigate, restart machines
Database connection	> 90% pool usage	2 minutes	Check for connection leaks
Error rate	> 10%	2 minutes	Check logs, identify root cause
CPU usage	> 90%	5 minutes	Scale up or investigate
Memory usage	> 95%	5 minutes	Scale up or restart
Disk usage	> 95%	5 minutes	Clean up or scale storage

**Warning Alerts (Monitor Closely):**

Metric	Threshold	Duration	Action
Response time	> 3s p95	5 minutes	Investigate slow queries
Error rate	> 5%	5 minutes	Check logs for patterns
CPU usage	> 80%	10 minutes	Prepare to scale
Memory usage	> 85%	10 minutes	Monitor, prepare to scale
Redis hit rate	< 70%	15 minutes	Review caching strategy

**Informational Alerts (Track Metrics):**

Metric	Threshold	Duration	Action
Agent execution rate	< 10/hour	1 hour	Business as usual
Graduation exam rate	< 5/hour	1 hour	Business as usual
Daily active tenants	< 5	1 day	Review engagement

Monitoring Tools

**Built-in Tools:**

Cloud Console (metrics, logs, nodes)
ATOM Cloud CLI (atom-cli commands)
Neon console (database metrics)
Upstash console (Redis metrics)

**External Tools (Optional):**

Sentry (error tracking)
Datadog (APM and metrics)
Grafana (custom dashboards)
PagerDuty (on-call routing)

---

Incident Response

Incident Severity Levels

**SEV-0 (Critical):**

Definition: Complete service outage or data loss
Impact: All users affected
Response Time: Immediate (< 5 minutes)
Examples: All machines down, database unavailable, data corruption

**SEV-1 (High):**

Definition: Major feature degradation or partial outage
Impact: Many users affected, critical paths broken
Response Time: < 15 minutes
Examples: Agent execution failing, auth broken, payment processing down

**SEV-2 (Medium):**

Definition: Minor feature degradation or performance issues
Impact: Some users affected, workarounds available
Response Time: < 1 hour
Examples: Slow response times, non-critical integration down, UI bugs

**SEV-3 (Low):**

Definition: Cosmetic issues or edge cases
Impact: Few users affected, no business impact
Response Time: < 4 hours
Examples: Typos, minor UI glitches, documentation errors

Incident Response Process

**1. Detection (Alert Received)**

Alert triggered via monitoring
PagerDuty/notification sent
On-call engineer acknowledges

**2. Assessment (Understand Impact)**

Check dashboards for metrics
Review logs for errors
Determine severity level
Identify affected users

**3. Mitigation (Stop the Bleeding)**

Implement temporary fix
Restore service if possible
Communicate status to users
Document actions taken

**4. Resolution (Fix Root Cause)**

Implement permanent fix
Test in staging
Deploy to production
Verify fix works

**5. Post-Mortem (Learn and Improve)**

Document incident timeline
Identify root cause
Create action items
Update runbook if needed

Common Incidents & Playbooks

Incident 1: Database Connection Failures

**Symptoms:**

500 errors on all endpoints
Logs show "could not connect to server"
Health checks failing

**Detection:**

# Check health endpoint
curl https://[tenant].atomagentos.com/api/health

# View logs for database errors
atom-cli logs | grep -i "database\|connection"

**Mitigation:**

# 1. Check DATABASE_URL secret
atom-cli secrets list

# 2. Test database connection
atom-cli console
python -c "from core.database import engine; print(engine.url)"

# 3. Restart node (connection pool leak)
atom-cli nodes restart <node-id>

# 4. Scale up (connection exhaustion)
atom-cli scale --count 2

**Resolution:**

If connection leak: Fix in code (ensure connections closed)
If pool exhausted: Increase pool_size or scale app
If database issue: Check Neon status page

**Prevention:**

Enable connection pool monitoring
Set connection timeout values
Use connection pooling properly
Regular restarts during maintenance

Incident 2: High Error Rates (> 10%)

**Symptoms:**

Spike in 500 errors
User reports of failures
Error rate alert triggered

**Detection:**

# View error logs
atom-cli logs | grep "ERROR"

# Check recent deployments
atom-cli deployments

# View node status
atom-cli status

**Mitigation:**

# 1. Check if recent deployment caused issue
atom-cli deployments
# Rollback if needed:
atom-cli rollback <version>

# 2. Restart affected node
atom-cli nodes restart <node-id>

# 3. Scale up if resource issue
atom-cli scale --cpu 2 --memory 2048

# 4. Check for downstream dependencies
# (LLM providers, Redis, database)

**Resolution:**

Identify root cause from logs
Fix code issue and deploy
Update runbook if new issue
Add monitoring if needed

**Prevention:**

Comprehensive testing before deploy
Staging environment validation
Gradual rollout (feature flags)
Monitor metrics after deploy

Incident 3: Slow Response Times (> 3s p95)

**Symptoms:**

User complaints about slowness
Response time alert triggered
Dashboard shows elevated latency

**Detection:**

# View recent logs with timing
atom-cli logs --service api --lines 100 | grep "Request completed"

# Check database for slow queries
atom-cli console
python -c "from core.database import engine; # check slow queries"

# Check CPU/memory
atom-cli status

**Mitigation:**

# 1. Scale up (resource constraint)
atom-cli scale --cpu 2 --memory 2048

# 2. Restart node (memory leak)
atom-cli nodes restart <node-id>

# 3. Check database connection pool
# (May need to increase pool_size)

# 4. Check for long-running queries
# (Kill or optimize slow queries)

**Resolution:**

Identify slow queries and optimize
Add database indexes if needed
Implement caching for expensive operations
Optimize LLM calls (reduce tokens, cache results)

**Prevention:**

Regular performance monitoring
Query performance testing
Caching strategy
Load testing before major changes

Incident 4: LLM Provider Outage

**Symptoms:**

Agent execution failing
OpenAI/Anthropic API errors
500 errors on AI-dependent endpoints

**Detection:**

# View logs for LLM errors
# View logs for LLM errors
atom-cli logs | grep -i "openai\|anthropic\|llm"

# Test LLM provider status
curl https://status.openai.com/
curl https://status.anthropic.com/

**Mitigation:**

# 1. Check API keys (may have expired)
# 1. Check API keys (may have expired)
atom-cli secrets list | grep -i "api_key"

# 2. Switch to backup provider
# (Update OPENAI_API_KEY to ANTHROPIC_API_KEY in code)

# 3. Disable AI features temporarily
# (Set feature flag to skip LLM calls)

# 4. Use cached responses if available
# (Redis cache may have recent results)

**Resolution:**

Wait for provider to restore service
Implement fallback providers in code
Add retry logic with exponential backoff
Cache LLM responses to reduce dependency

**Prevention:**

Implement multiple LLM providers (BYOK)
Add caching for LLM responses
Implement graceful degradation
Monitor provider status pages

Incident 5: Redis Connection Errors

**Symptoms:**

Rate limiting not working
Session management failing
Cache misses (100% miss rate)
Logs show Redis connection errors

**Detection:**

# Test Redis connectivity
atom-cli console
curl $REDIS_URL/ping

# View Redis errors
# View Redis errors
atom-cli logs | grep -i "redis"

**Mitigation:**

# 1. Check REDIS_URL secret
atom-cli secrets list | grep REDIS

# 2. Test Redis directly
curl https://<redis-url>/ping

# 3. Restart app (may be connection pool issue)
atom-cli nodes restart <node-id>

# 4. Operate without Redis (degraded mode)
# (Rate limiting disabled, sessions in DB)

**Resolution:**

If Upstash outage: Wait for service restore
If connection leak: Fix in code
If wrong URL: Update secret
If quota exceeded: Upgrade Upstash plan

**Prevention:**

Monitor Redis hit rate
Test Redis connectivity in health checks
Implement graceful degradation (work without Redis)
Set connection timeouts

Incident 6: Memory Leaks (High Memory Usage)

**Symptoms:**

Memory usage steadily increasing
Machine restarts (OOM killer)
Performance degradation over time

**Detection:**

# Check memory usage
# Check memory usage
atom-cli status

# View memory over time
atom-cli logs | grep "memory"

# Console access to check process memory
atom-cli console
ps aux | grep python

**Mitigation:**

# 1. Restart node (temporary fix)
atom-cli nodes restart <node-id>

# 2. Scale up (more memory)
atom-cli scale --memory 2048

# 3. Schedule regular restarts
# (Cron job to restart machines daily)

**Resolution:**

Identify memory leak source (profiling)
Fix in code (unclosed connections, large objects)
Implement memory limits (ulimit)
Add memory monitoring alerts

**Prevention:**

Regular memory profiling
Load testing with memory monitoring
Code reviews for memory management
Automated restarts (maintenance window)

Escalation Procedures

**When to Escalate:**

**SEV-0 Incident:** Immediate escalation to senior engineering
**Unknown issue:** Escalate after 30 minutes of troubleshooting
**Customer impact:** Escalate immediately if enterprise customers affected
**Data loss risk:** Escalate immediately, involve database team

**Escalation Contact Order:**

**On-Call Engineer** (Initial response)
**Senior DevOps Engineer** (If unresolved in 30 minutes)
**Engineering Manager** (If customer impact)
**CTO** (If SEV-0 or data loss risk)

**Communication Template:**

SUBJECT: [SEV-X] <Incident Title>

SEVERITY: SEV-0/1/2/3
STATUS: Investigating/Mitigated/Resolved
STARTED: <timestamp>
AFFECTED: <users/services>
CURRENT IMPACT: <description>

CURRENT STATUS:
<What's happening now>

MITIGATION STEPS:
<What we're doing>

NEXT UPDATE: <timestamp>

---

Common Issues & Resolutions

Deployment Issues

Issue 1: Build Failures

**Symptoms:**

ERROR: failed to calculate checksum: "/requirements.txt": not found

**Resolution:**

Check .dockerignore has !requirements*.txt at END
Verify Dockerfile paths match build context
Try atom-cli deploy without cache

**Reference:** DEPLOYMENT_TROUBLESHOOTING.md

Issue 2: Migration Failures

**Symptoms:**

Error: release command failed - aborting deployment
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.DuplicateTable)

**Resolution:**

Set release_command = "" in infrastructure.config
Migrations will run in lifespan() instead
Or make migrations idempotent (check if exists)

Issue 3: Health Check Failures

**Symptoms:**

WARNING The app is not listening on the expected address

**Resolution:**

Check app binds to 0.0.0.0 (not 127.0.0.1)
Verify port matches infrastructure.config internal_port
Increase grace_period in infrastructure.config
Check for startup errors in logs

Runtime Issues

Issue 4: Machine Auto-Stops

**Symptoms:**

atom-saas-api machines stopped
API returns 503/404
Machines show "stopped" status

**Resolution:**

# Start machine manually
atom-cli nodes start <id>

# Or trigger by making API request
curl https://[tenant].atomagentos.com/alive

# Disable auto-stop (if needed)
atom-cli scale --min 2

Issue 5: Rate Limiting Errors

**Symptoms:**

WARNING: Rate limit exceeded for tenant <tenant_id>
429 Too Many Requests

**Resolution:**

Check if tenant exceeded plan quota
Upgrade tenant plan if needed
Check if Redis is working (rate limiting requires Redis)
Reset quota if legitimate issue

Issue 6: Agent Execution Failures

**Symptoms:**

Agent execution returns 500
Logs show governance errors
Episodes not being recorded

**Resolution:**

Check agent maturity level vs action complexity
Verify agent governance cache (may need restart)
Check LLM provider status
Review agent configuration

Database Issues

Issue 7: Connection Pool Exhaustion

**Symptoms:**

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection pool exhausted

**Resolution:**

# 1. Restart app (frees connections)
atom-cli nodes restart <id>

# 2. Scale up (more connections)
atom-cli scale --count 2

# 3. Increase pool_size in code
# (Edit database.py and redeploy)

Issue 8: Slow Query Performance

**Symptoms:**

Database queries > 1s
API endpoints slow
Logs show slow query warnings

**Resolution:**

# 1. Identify slow queries
atom-cli console
# Check Neon console for slow query log

# 2. Add indexes
alembic revision -m "add indexes"
# Edit migration to add indexes

# 3. Optimize query
# (Use select_in, add pagination, etc.)

Integration Issues

Issue 9: OAuth Callback Failures

**Symptoms:**

OAuth redirects fail
Token storage errors
Integration state not updating

**Resolution:**

Check callback URL matches Cloud app URL
Verify OAuth client ID/secret secrets
Check tenant isolation in integration tables
Review integration logs

Issue 10: Stripe Webhook Failures

**Symptoms:**

Webhook returns 500
Subscription events not processed
Billing not updated

**Resolution:**

Verify Stripe webhook secret
Check webhook signature validation
Test webhook endpoint with Stripe CLI
Review tenant_id extraction

---

Data Backup & Recovery

Backup Strategy

**Database Backups (Neon PostgreSQL):**

**Automated:** Neon provides continuous backups
**Retention:** 7 days (point-in-time recovery available)
**Frequency:** Continuous (WAL logs)
**Location:** Neon-managed storage

**Storage Backups (AWS S3):**

**Automated:** S3 versioning enabled
**Retention:** 30 days
**Frequency:** Per object upload
**Location:** Same region as S3 bucket

**Redis Backups (Upstash):**

**No automatic backups** (ephemeral cache)
**Data can be rebuilt from database**
**Critical:** Rate limits, sessions (can be recreated)

Backup Verification

**Weekly Backup Checks:**

# 1. List recent backups (Neon console)
# Navigate to: Neon Console > Database > Backups

# 2. Test point-in-time recovery
# (Create clone database from backup)

# 3. Verify S3 versioning
aws s3api list-object-versions --bucket atom-saas

# 4. Check Redis persistence
# (No backups - data is cache only)

Recovery Procedures

Database Recovery

**Scenario 1: Restore from Backup**

# 1. Identify backup timestamp
# (Neon Console > Backups)

# 2. Create recovery database
# (Neon Console > Create Branch > Point in Time)

# 3. Update DATABASE_URL secret
atom-cli secrets set DATABASE_URL=<new-url>

# 4. Restart app to use new database
atom-cli nodes restart <id>

# 5. Verify data integrity
curl https://[tenant].atomagentos.com/api/v1/health

**Scenario 2: Rollback Migration**

# 1. Access console
atom-cli console

# 2. Navigate to app directory
cd /app

# 3. Rollback last migration
alembic downgrade -1

# 4. Verify current revision
alembic current

# 5. Exit and restart node
exit
atom-cli nodes restart <id>

Storage Recovery (S3)

**Scenario 1: Restore Deleted Object**

# 1. List object versions
aws s3api list-object-versions \
  --bucket atom-saas \
  --prefix "tenant-abc/file.pdf"

# 2. Restore specific version
aws s3api get-object \
  --bucket atom-saas \
  --key "tenant-abc/file.pdf" \
  --version-id <version-id> \
  restored-file.pdf

# 3. Upload restored object
aws s3 cp restored-file.pdf \
  s3://atom-saas/tenant-abc/file.pdf

Redis Recovery (Cache Rebuild)

**Scenario 1: Redis Cache Cleared**

# 1. Redis data is cache-only (no recovery needed)
# Data will be rebuilt on next request

# 2. Warm up critical caches
# (Trigger API calls to rebuild cache)

# 3. Monitor hit rate
# (Should improve over time)

Disaster Recovery

**Complete Site Failure:**

**Scenario:** All Cloud Nodes down, data center outage

**Recovery Steps:**

**Assess Impact**

Check Cloud Status page
Determine scope of outage

**Restore Database**

Create new database from backup
Update DATABASE_URL secret

**Redeploy App**

**Restore S3 Data**

S3 is separate (likely unaffected)
Verify S3 connectivity

**Verify Services**

Test health endpoints
Smoke test critical paths
Monitor metrics

**RTO (Recovery Time Objective):** 2-4 hours

**RPO (Recovery Point Objective):** 5 minutes (Neon continuous backups)

---

Security & Compliance

Security Layers

**Multi-Tenancy Isolation**

Row-Level Security (RLS) on all tables
Tenant_id required for all queries
Subdomain-based tenant routing

**Authentication & Authorization**

NextAuth.js for session management
Role-based access control (RBAC)
Agent maturity-based permissions

**Network Security**

HTTPS enforced (TLS 1.2+)
CORS configured for allowed origins
Rate limiting (AbuseProtectionService)

**Data Security**

Encrypted at rest (Neon, S3)
Encrypted in transit (TLS)
Tenant API keys isolated

**Application Security**

Input validation (Pydantic schemas)
SQL injection prevention (SQLAlchemy)
XSS prevention (React escaping)

Security Monitoring

**Daily Checks:**

Review error logs for security issues
Check for failed auth attempts
Monitor rate limit violations

**Weekly Checks:**

Review access logs for anomalies
Audit tenant permission changes
Check for new vulnerabilities

**Monthly Checks:**

Run security scans (npm audit, pip-audit)
Review third-party dependencies
Update runbook with new threats

Security Incidents

**Incident Types:**

**Unauthorized Access**

Symptoms: Suspicious login attempts, data breaches
Response: Revoke sessions, force password reset
Prevention: MFA, rate limiting, audit logging

**Data Exposure**

Symptoms: Sensitive data in logs, unauthorized queries
Response: Rotate secrets, audit logs
Prevention: Log redaction, query validation

**DDoS Attack**

Symptoms: Spike in requests, rate limit alerts
Response: Enable Cloud DDoS protection
Prevention: Rate limiting, CAPTCHA

Compliance

**GDPR Compliance:**

Right to erasure: /api/users/[id]/delete endpoint
Data export: /api/users/[id]/export endpoint
Consent management: Tenant settings

**SOC 2 Compliance:**

Audit logging: All actions logged
Access controls: RBAC enforced
Data encryption: At rest and in transit
Incident response: Documented procedures

---

Maintenance Windows

Scheduled Maintenance

**Weekly Maintenance (Sundays 2-4 AM UTC):**

Database maintenance (Neon)
Machine restarts (memory leaks)
Log cleanup
Backup verification

**Monthly Maintenance (First Sunday 2-6 AM UTC):**

Dependency updates
Security patches
Performance optimization
Runbook updates

**Quarterly Maintenance:**

Major version upgrades
Architecture review
Cost optimization
Disaster recovery drill

Maintenance Process

**Before Maintenance:**

Notify users 24 hours in advance
Create backup (verify integrity)
Set maintenance mode (if needed)
Document rollback plan

**During Maintenance:**

Execute maintenance tasks
Verify services after changes
Monitor metrics closely
Have rollback ready

**After Maintenance:**

Remove maintenance mode
Smoke test critical paths
Update runbook if changed
Post-incident report (if issues)

---

Emergency Contacts

On-Call Rotation

**Primary On-Call:**

**Name:** [On-Call Engineer]
**Phone:** [Phone Number]
**Email:** [Email]
**Hours:** 24/7

**Escalation:**

**Senior DevOps:** [Name, Phone, Email]
**Engineering Manager:** [Name, Phone, Email]
**CTO:** [Name, Phone, Email]

Service Providers

**ATOM Cloud Support:**

**Status Page:** https://status.atomagentos.com
**Support:** https://community.atomagentos.com
**Docs:** https://docs.atomagentos.com

**Neon Database:**

**Status Page:** https://status.neon.tech
**Support:** support@neon.tech
**Docs:** https://neon.tech/docs

**Upstash Redis:**

**Status Page:** https://status.upstash.com
**Support:** support@upstash.com
**Docs:** https://upstash.com/docs

**AWS (S3, SES):**

**Status Page:** https://status.aws.amazon.com
**Support:** AWS Support Center
**Docs:** https://docs.aws.amazon.com

**Stripe:**

**Status Page:** https://status.stripe.com
**Support:** https://support.stripe.com
**Docs:** https://stripe.com/docs

Critical Services

**Monitoring & Alerting:**

Cloud Console: https://console.atomagentos.com
Neon Console: https://console.neon.tech
Upstash Console: https://console.upstash.com

**Emergency Access:**

# Console access (emergency only)
atom-cli console

# Emergency restart
atom-cli nodes restart --all

# Emergency rollback
atom-cli rollback

---

Appendices

Appendix A: ATOM Cloud CLI Cheat Sheet

# Apps
atom-cli list
atom-cli status
atom-cli info

# Deployments
atom-cli deploy
atom-cli deployments
atom-cli rollback <version>

# Nodes
atom-cli nodes list
atom-cli nodes start <id>
atom-cli nodes stop <id>
atom-cli nodes restart <id>

# Logs
atom-cli logs
atom-cli logs --lines 100
atom-cli logs --tail

# Secrets
atom-cli secrets list
atom-cli secrets set KEY=value
atom-cli secrets unset KEY

# Console
atom-cli console

# Scaling
atom-cli scale --count 2
atom-cli scale --cpu 2 --memory 2048

# Regions
atom-cli regions list
atom-cli regions set iad,ewr

Appendix B: Database Commands

# Migrations
alembic upgrade head
alembic downgrade -1
alembic current
alembic history
alembic revision -m "description"

# Database connection
psql $DATABASE_URL
\dt # List tables
\d table_name # Describe table
\q # Quit

# Backup
pg_dump $DATABASE_URL > backup.sql

# Restore
psql $DATABASE_URL < backup.sql

Appendix C: Monitoring Queries

**Slow Queries (Neon Console):**

SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

**Connection Count:**

SELECT count(*) FROM pg_stat_activity;

**Table Sizes:**

SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

**Locks:**

SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock';

Appendix D: Runbook Maintenance

**Version History:**

v1.0 (2026-02-22): Initial creation
Future updates: Document changes here

**Update Process:**

Make changes to this document
Update version number and date
Add summary of changes to version history
Commit to repository
Notify team of updates

**Review Schedule:**

Monthly: Review for accuracy
Quarterly: Major updates and improvements
Annually: Complete rewrite if needed

---

**Document Owner:** DevOps Team

**Last Reviewed:** 2026-02-22

**Next Review:** 2026-03-22

---

**End of Production Runbook**

ATOM SaaS Production Runbook

Table of Contents

Overview

Platform Components

Key Technologies

Production Architecture

Application Deployment Strategy

Environment Variables

Health Check Endpoints

Deployment Procedures

Prerequisites Checklist

Deployment: Main App (web-platform)

Deployment: Backend API (api-service)

Post-Deployment Verification

Rollback Procedures

Zero-Downtime Deployment Strategy

Monitoring & Observability

Key Metrics to Monitor

Application-Level Metrics

Infrastructure Metrics

LLM Provider Metrics

Monitoring Dashboards

Log Aggregation

View real-time logs

View last N lines

Follow logs (tail -f)

Alert Thresholds

Monitoring Tools

Incident Response

Incident Severity Levels

Incident Response Process

Common Incidents & Playbooks

Incident 1: Database Connection Failures

Incident 2: High Error Rates (> 10%)

Incident 3: Slow Response Times (> 3s p95)

Incident 4: LLM Provider Outage

Incident 5: Redis Connection Errors

Incident 6: Memory Leaks (High Memory Usage)

Escalation Procedures

Common Issues & Resolutions

Deployment Issues

Issue 1: Build Failures

Issue 2: Migration Failures

Issue 3: Health Check Failures

Runtime Issues

Issue 4: Machine Auto-Stops

Issue 5: Rate Limiting Errors

Issue 6: Agent Execution Failures

Database Issues

Issue 7: Connection Pool Exhaustion

Issue 8: Slow Query Performance

Integration Issues

Issue 9: OAuth Callback Failures

Issue 10: Stripe Webhook Failures

Data Backup & Recovery

Backup Strategy

Backup Verification

Recovery Procedures

Database Recovery

Storage Recovery (S3)

Redis Recovery (Cache Rebuild)

Disaster Recovery

Security & Compliance

Security Layers

Security Monitoring

Security Incidents

Compliance

Maintenance Windows

Scheduled Maintenance

Maintenance Process

Emergency Contacts

On-Call Rotation

Service Providers

Critical Services

Appendices

Appendix A: ATOM Cloud CLI Cheat Sheet

Appendix B: Database Commands

Appendix C: Monitoring Queries

Appendix D: Runbook Maintenance